52 research outputs found

    Automatically Discovering the Number of Clusters in Web Page Datasets

    Get PDF
    Clustering is well-suited for Web mining by automatically organizing Web pages into categories, each of which contains Web pages having similar contents. However, one problem in clustering is the lack of general methods to automatically determine the number of categories or clusters. For the Web domain in particular, currently there is no such method suitable for Web page clustering. In an attempt to address this problem, we discover a constant factor that characterizes the Web domain, based on which we propose a new method for automatically determining the number of clusters in Web page data sets. We discover that the measure of average inter-cluster similarity reaches a constant of 1.7 when all our experiments produced the best results for clustering Web pages. We determine the number of clusters by using the constant as the stopping factor in our clustering process by arranging individual Web pages into clusters and then arranging the clusters into larger clusters and so on until the average inter-cluster similarity approaches the constant. Having the new method described in this paper together with our new Bidirectional Hierarchical Clustering algorithm reported elsewhere, we have developed a clustering system suitable for mining the Web

    A Survey on Differential Privacy with Machine Learning and Future Outlook

    Full text link
    Nowadays, machine learning models and applications have become increasingly pervasive. With this rapid increase in the development and employment of machine learning models, a concern regarding privacy has risen. Thus, there is a legitimate need to protect the data from leaking and from any attacks. One of the strongest and most prevalent privacy models that can be used to protect machine learning models from any attacks and vulnerabilities is differential privacy (DP). DP is strict and rigid definition of privacy, where it can guarantee that an adversary is not capable to reliably predict if a specific participant is included in the dataset or not. It works by injecting a noise to the data whether to the inputs, the outputs, the ground truth labels, the objective functions, or even to the gradients to alleviate the privacy issue and protect the data. To this end, this survey paper presents different differentially private machine learning algorithms categorized into two main categories (traditional machine learning models vs. deep learning models). Moreover, future research directions for differential privacy with machine learning algorithms are outlined.Comment: 12 pages, 3 figure

    Node Isolation Model and Age-Based Neighbor Selection in Unstructured P2P Networks

    Get PDF
    Previous analytical studies of unstructured P2P resilience have assumed exponential user lifetimes and only considered age-independent neighbor replacement. In this paper, we overcome these limitations by introducing a general node-isolation model for heavy-tailed user lifetimes and arbitrary neighbor-selection algorithms. Using this model, we analyze two age-biased neighbor-selection strategies and show that they significantly improve the residual lifetimes of chosen users, which dramatically reduces the probability of user isolation and graph partitioning compared with uniform selection of neighbors. In fact, the second strategy based on random walks on age-proportional graphs demonstrates that, for lifetimes with infinite variance, the system monotonically increases its resilience as its age and size grow. Specifically, we show that the probability of isolation converges to zero as these two metrics tend to infinity. We finish the paper with simulations in finite-size graphs that demonstrate the effect of this result in practice

    Unstructured P2P Link Lifetimes Redux

    Get PDF
    We revisit link lifetimes in random P2P graphs under dynamic node failure and create a unifying stochastic model that generalizes the majority of previous efforts in this direction. We not only allow nonexponential user lifetimes and age-dependent neighbor selection, but also cover both active and passive neighbor-management strategies, model the lifetimes of incoming and outgoing links, derive churn-related message volume of the system, and obtain the distribution of transient in/out degree at each user. We then discuss the impact of design parameters on overhead and resilience of the network

    On Node Isolation under Churn in Unstructured P2P Networks with Heavy-Tailed Lifetimes

    Get PDF
    Previous analytical studies [12], [18] of unstructured P2P resilience have assumed exponential user lifetimes and only considered age-independent neighbor replacement. In this paper, we overcome these limitations by introducing a general node-isolation model for heavy-tailed user lifetimes and arbitrary neighbor-selection algorithms. Using this model, we analyze two age-biased neighbor-selection strategies and show that they significantly improve the residual lifetimes of chosen users, which dramatically reduces the probability of user isolation and graph partitioning compared to uniform selection of neighbors. In fact, the second strategy based on random walks on age-weighted graphs demonstrates that for lifetimes with infinite variance, the system monotonically increases its resilience as its age and size grow. Specifically, we show that the probability of isolation converges to zero as these two metrics tend to infinity. We finish the paper with simulations in finite-size graphs that demonstrate the effect of this result in practice

    Residual-Based Estimation of Peer and Link Lifetimes in P2P Networks

    Get PDF
    Existing methods of measuring lifetimes in P2P systems usually rely on the so-called Create-BasedMethod (CBM), which divides a given observation window into two halves and samples users ldquocreatedrdquo in the first half every Delta time units until they die or the observation period ends. Despite its frequent use, this approach has no rigorous accuracy or overhead analysis in the literature. To shed more light on its performance, we first derive a model for CBM and show that small window size or large Delta may lead to highly inaccurate lifetime distributions. We then show that create-based sampling exhibits an inherent tradeoff between overhead and accuracy, which does not allow any fundamental improvement to the method. Instead, we propose a completely different approach for sampling user dynamics that keeps track of only residual lifetimes of peers and uses a simple renewal-process model to recover the actual lifetimes from the observed residuals. Our analysis indicates that for reasonably large systems, the proposed method can reduce bandwidth consumption by several orders of magnitude compared to prior approaches while simultaneously achieving higher accuracy. We finish the paper by implementing a two-tier Gnutella network crawler equipped with the proposed sampling method and obtain the distribution of ultrapeer lifetimes in a network of 6.4 million users and 60 million links. Our experimental results show that ultrapeer lifetimes are Pareto with shape alpha ap 1.1; however, link lifetimes exhibit much lighter tails with alpha ap 1.8

    Residual-Based Measurement of Peer and Link Lifetimes in Gnutella Networks

    Get PDF
    Existing methods of measuring lifetimes in P2P systems usually rely on the so-called create-based method (CBM), which divides a given observation window into two halves and samples users created in the first half every Delta time units until they die or the observation period ends. Despite its frequent use, this approach has no rigorous accuracy or overhead analysis in the literature. To shed more light on its performance, we flrst derive a model for CBM and show that small window size or large Delta may lead to highly inaccurate lifetime distributions. We then show that create-based sampling exhibits an inherent tradeoff between overhead and accuracy, which does not allow any fundamental improvement to the method. Instead, we propose a completely different approach for sampling user dynamics that keeps track of only residual lifetimes of peers and uses a simple renewal-process model to recover the actual lifetimes from the observed residuals. Our analysis indicates that for reasonably large systems, the proposed method can reduce bandwidth consumption by several orders of magnitude compared to prior approaches while simultaneously achieving higher accuracy. We finish the paper by implementing a two-tier Gnutella network crawler equipped with the proposed sampling method and obtain the distribution of ultrapeer lifetimes in a network of 6.4 million users and 60 million links. Our experimental results show that ultrapeer lifetimes are Pareto with shape a alpha ap 1.1; however, link lifetimes exhibit much lighter tails with alpha ap 1.9

    Understanding Churn in Decentralized Peer-to-Peer Networks

    Get PDF
    This dissertation presents a novel modeling framework for understanding the dynamics of peer-to-peer (P2P) networks under churn (i.e., random user arrival/departure) and designing systems more resilient against node failure. The proposed models are applicable to general distributed systems under a variety of conditions on graph construction and user lifetimes. The foundation of this work is a new churn model that describes user arrival and departure as a superposition of many periodic (renewal) processes. It not only allows general (non-exponential) user lifetime distributions, but also captures heterogeneous behavior of peers. We utilize this model to analyze link dynamics and the ability of the system to stay connected under churn. Our results offers exact computation of user-isolation and graph-partitioning probabilities for any monotone lifetime distribution, including heavy-tailed cases found in real systems. We also propose an age-proportional random-walk algorithm for creating links in unstructured P2P networks that achieves zero isolation probability as system size becomes infinite. We additionally obtain many insightful results on the transient distribution of in-degree, edge arrival process, system size, and lifetimes of live users as simple functions of the aggregate lifetime distribution. The second half of this work studies churn in structured P2P networks that are usually built upon distributed hash tables (DHTs). Users in DHTs maintain two types of neighbor sets: routing tables and successor/leaf sets. The former tables determine link lifetimes and routing performance of the system, while the latter are built for ensuring DHT consistency and connectivity. Our first result in this area proves that robustness of DHTs is mainly determined by zone size of selected neighbors, which leads us to propose a min-zone algorithm that significantly reduces link churn in DHTs. Our second result uses the Chen-Stein method to understand concurrent failures among strongly dependent successor sets of many DHTs and finds an optimal stabilization strategy for keeping Chord connected under churn

    Unstructured P2P link lifetimes redux

    Full text link
    We revisit link lifetimes in random P2P graphs under dynamic node failure and create a unifying stochastic model that generalizes the majority of previous efforts in this direction. We not only allow non-exponential user lifetimes and age-dependent neighbor selection, but also cover both active and passive neighbor-management strategies, model the lifetimes of incoming and outgoing links, derive churn-related message volume of the system, and obtain the distribution of transient in/out degree at each user. We then discuss the impact of design parameters on overhead and resilience of the network

    Node Isolation Model and Age-Based Neighbor Selection in Unstructured P2P Networks

    Full text link
    corecore